Rows: 15976 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (12): Date, Source, Site ID, Units, Local Site Name, AQS Parameter Descr...
dbl (10): POC, Daily Mean PM2.5 Concentration, Daily AQI Value, Daily Obs Co...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data2022 <-read_csv("~/Downloads/2022 .csv")
Rows: 59918 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (12): Date, Source, Site ID, Units, Local Site Name, AQS Parameter Descr...
dbl (10): POC, Daily Mean PM2.5 Concentration, Daily AQI Value, Daily Obs Co...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(data2002$`Daily Mean PM2.5 Concentration`)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 7.00 12.00 16.12 20.50 104.30
summary(data2022$`Daily Mean PM2.5 Concentration`)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-6.700 4.100 6.800 8.414 10.700 302.500
layout(matrix(1:2, nrow=1))hist(data2002$`Daily Mean PM2.5 Concentration`)hist(data2022$`Daily Mean PM2.5 Concentration`)
layout(1)
summary(data2002[,20:22])
County Site Latitude Site Longitude
Length:15976 Min. :32.63 Min. :-124.2
Class :character 1st Qu.:34.07 1st Qu.:-121.4
Mode :character Median :35.36 Median :-119.1
Mean :36.00 Mean :-119.4
3rd Qu.:37.77 3rd Qu.:-117.9
Max. :41.71 Max. :-115.5
summary(data2022[,20:22])
County Site Latitude Site Longitude
Length:59918 Min. :32.58 Min. :-124.2
Class :character 1st Qu.:34.07 1st Qu.:-121.4
Mode :character Median :36.49 Median :-119.6
Mean :36.25 Mean :-119.6
3rd Qu.:37.96 3rd Qu.:-117.9
Max. :41.76 Max. :-115.5
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 39.00 56.00 59.28 72.00 185.00
summary(data2022$`Daily AQI Value`)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 23.00 38.00 39.22 54.00 454.00
hist(data2002$`Daily AQI Value`)
hist(data2022$`Daily AQI Value`)
table(data2002$County)
Alameda Butte Calaveras Colusa Contra Costa
201 473 60 95 276
Del Norte El Dorado Fresno Humboldt Imperial
110 208 760 59 342
Inyo Kern Kings Lake Los Angeles
277 1800 83 61 1879
Marin Mariposa Mendocino Merced Modoc
97 290 122 89 2
Mono Monterey Nevada Orange Placer
111 120 226 470 60
Plumas Riverside Sacramento San Benito San Bernardino
177 1017 819 119 835
San Diego San Francisco San Joaquin San Luis Obispo San Mateo
1350 196 124 168 100
Santa Barbara Santa Clara Santa Cruz Shasta Siskiyou
152 459 61 271 104
Solano Sonoma Stanislaus Sutter Trinity
97 93 183 114 90
Tulare Ventura Yolo
508 556 112
table(data2022$County)
Alameda Butte Calaveras Colusa Contra Costa
1793 1121 355 401 815
Del Norte El Dorado Fresno Glenn Humboldt
458 228 2761 351 116
Imperial Inyo Kern Kings Lake
1625 3186 2333 721 61
Los Angeles Madera Marin Mariposa Mendocino
5070 360 476 574 701
Merced Mono Monterey Nevada Orange
719 944 1119 734 879
Placer Plumas Riverside Sacramento San Benito
1774 1118 4546 2556 478
San Bernardino San Diego San Francisco San Joaquin San Luis Obispo
2715 4598 361 1016 1427
San Mateo Santa Barbara Santa Clara Santa Cruz Shasta
356 1247 1182 709 480
Siskiyou Solano Sonoma Stanislaus Sutter
424 720 348 745 712
Tehama Trinity Tulare Ventura Yolo
347 402 1223 2152 381
table(data2002$State)
California
15976
table(data2022$State)
California
59918
table(data2002$`Daily Obs Count`)
1
15976
table(data2022$`Daily Obs Count`)
1
59918
table(data2002$`Percent Complete`)
100
15976
table(data2022$`Percent Complete`)
100
59918
table(data2002$`Method Description`)
Andersen RAAS2.5-300 PM2.5 SEQ w/WINS
8651
IMPROVE Module A with Cyclone Inlet-Teflon Filter, 2.2 sq. cm.
1873
Met One SASS/SuperSASS Teflon
1262
Met-One BAM-1020 W/PM2.5 SCC
1417
R & P Model 2000 PM2.5 Sampler w/WINS
2088
R & P Model 2025 PM2.5 Sequential w/WINS
685
table(data2022$`Method Description`)
IMPROVE Module A with Cyclone Inlet-Teflon Filter, 2.2 sq. cm.
2143
Met One BAM-1020 Mass Monitor w/VSCC
29722
Met One BAM-1022 Mass Monitor w/ VSCC or TE-PM2.5C
1775
Met One E-FRM PM2.5 with VSCC
61
Met One E-SEQ-FRM PM2.5 with VSCC
736
Met One SASS/SuperSASS Teflon
634
Met-One BAM W/PM2.5 VSCC
1376
Met-One BAM-1020 W/PM2.5 SCC
8890
Met-one BAM-1022 W/PM2.5 SCC
619
R & P Model 2000 PM-2.5 Air Sampler w/VSCC
516
R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC
5297
Teledyne T640 at 5.0 LPM
1344
Teledyne T640 at 5.0 LPM (Corrected)
1706
Teledyne T640 at 5.0 LPM w/Network Data Alignment enabled
362
Teledyne T640X at 16.67 LPM
1977
Teledyne T640X at 16.67 LPM (Corrected)
2043
Thermo Scientific 5014i or FH62C14-DHS w/VSCC
359
Thermo Scientific TEOM 1400 FDMS or 1405 8500C FDMS w/VSCC
358
summary(data2002$`Local Site Name`)
Length Class Mode
15976 character character
summary(data2022$`Local Site Name`)
Length Class Mode
59918 character character
Description
Both datasets contain the same 22 variables. data2002 has 15,976 rows spanning 01/01/2002–12/31/2002, while data2022 has 59,918 rows spanning 01/01/2022–12/31/2022. All data were collected in California.Both graphs of daily mean PM2.5 concentration are right-skewed. The daily AQI levels are also right-skewed for both 2002 and 2022; however, the 2002 data appears closer to a normal distribution compared to the 2022 data, where most values are clustered between 0–100. Values in the 2022 data set should not include negatives, since PM2.5 levels cannot be negative. These values should therefore be set to NA.
The larger 2022 dataset likely reflects a denser monitoring network and/or greater completeness, such as additional methods used for data collection in that year. In 2002, Madera, Tehama, and Glenn counties do not appear in the dataset, reflecting changes in network coverage over time. Based on online information, Glenn County was not monitored at that time because it had consistently good air quality, being a rural area. Madera County was not included because it had some of the worst air quality in the nation. In 2022, Modoc County data was not collected. According to internet sources, this may have been due to a large wildfire that significantly affected the county.
DescriptionI combined both data set using rbind, and made a new column for year and checked. 2002 has 15976 rows and 2022 has 59918 which matches with original observation. Also, I made lat and long columns, which will make me easier when making a map for question 3.
DescriptionThe distribution of monitoring sites in California expanded substantially between 2002 and 2022. From the map, we can see some overlaps in blue and red dots, but there are more blue dots than red dots meaning that there are more monitor site in 2022, so more data were collected. The 2022 monitor sites covers more denser statewide coverage.
Question 4
summary(combined$`Daily Mean PM2.5 Concentration`)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-6.70 4.40 7.60 10.04 12.20 302.50
mean(combined$`Daily Mean PM2.5 Concentration`<0, na.rm =TRUE)
[1] 0.002832899
combined_clean <- combined %>%filter(`Daily Mean PM2.5 Concentration`>=0)summary(combined_clean$`Daily Mean PM2.5 Concentration`)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 4.50 7.60 10.07 12.20 302.50
DescriptionFrom both Question 1 and Question 4, we can see that the Daily mean PM2.5 concentrations are contains negative values, which are not correct since PM2.5 should never be negative, so in this step, I dropped all the negative numbers. I made a new data set called combined_cleaned to make sure I keep a copy of the original. I also tray to see the proportion of implausible values ()
Question 5
library(dplyr)combined <- combined_clean %>%mutate(Date =as.Date(Date, format ="%m/%d/%Y"),year =format(Date, "%Y") ) #I changed combined_clean data back to combined and use made an extra column for data
Level1 State:
library(dplyr)state1 <- combined %>%mutate(Date =as.Date(Date, "%m/%d/%Y"),year =as.integer(format(Date, "%Y")),pm =ifelse(`Daily Mean PM2.5 Concentration`<0, NA, `Daily Mean PM2.5 Concentration`) )library(ggplot2)ggplot(state1, aes(x =factor(year), y = pm, )) +geom_violin() +labs(x ="Year", y ="Daily PM2.5 (µg/m³)",title ="Statewide PM2.5 distributions")
DescriptionLooking at statewide PM2.5 distribution, we can see that the wider range (which means more density of data), for 2022 is lower compare to 2002, which means that they have a lower Daily PM2.5 overall. At the same time, we see a tall spike that goes beyond 200 or even 300, suggesting outliers that has higher Daily mean PM2.5 concentration, so we need to examine it at county level to see what counties have those extreme data.
Level2 County:
library(dplyr)library(ggplot2)combined %>%mutate(pm =ifelse(`Daily Mean PM2.5 Concentration`<0, NA, `Daily Mean PM2.5 Concentration`) ) %>%filter(!is.na(pm)) %>%ggplot() +geom_violin(mapping =aes(x =factor(year), y = pm, fill =factor(year))) +facet_wrap(~ County, nrow =3) +labs(x ="Year", y ="Daily PM2.5 (µg/m³)",title ="PM2.5 distributions by county, 2002 vs 2022")
DescriptionFor different counties, we can see that some violin plots confirm our finding from Question 1, where Glenn, Madera, Modoc, and Tehama are missing one year of data. Overall, there is a decrease in daily mean PM2.5 concentrations from 2002 to 2022, as indicated by the blue violins being lower and narrower compared to the red. In urban areas, we see a general decreasing trend. However, there are exceptions in El Dorado, Imperial, Mariposa, Mono, Nevada, Placer, Plumas, Riverside, Siskiyou, and Trinity counties, where tall spikes extend above 100 µg/m³ and in some cases above 200 µg/m³. Several factors may explain these patterns. First, wildfire smoke, which is common in the summer, is likely to increase PM2.5 levels. In addition, geographic features such as mountain basins and foothills can trap smoke under inversions, resulting in episodes of high PM concentrations. To confirm my findings, I filtered months 6-9 which are summer where fire is likely to occur and same result applied indicating wildfire smokes’ impact in those area.
Level3 City:
la <- combined %>%mutate(Date =as.Date(Date, format ="%m/%d/%Y"),year =as.integer(format(Date, "%Y")),pm =ifelse(`Daily Mean PM2.5 Concentration`<0, NA_real_,`Daily Mean PM2.5 Concentration`),lat =`Site Latitude`,lon =`Site Longitude`,site =`Site ID` )
summary(la$`Daily Mean PM2.5 Concentration`)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 4.50 7.60 10.07 12.20 302.50
library(ggplot2)ggplot(la) +geom_violin(aes(x =factor(year), y = pm, fill =factor(year))) +labs(x ="Year", y ="Daily PM2.5 (µg/m³)",title ="Los Angeles County: daily PM2.5 distributions by monitoring site")
DescriptionI filtered out only data points from LA. From LA county data, we can see that the mean PM2.5 concentration varies from 0 to 302.05. We can see that similar to statewide data, LA data also has a wider range of PM2.5 that is lower than 2002, but a high spike indicating outliers in the data point. Looking at the data, the highest 7/31/2025 and the next few data points range from August and September, which make sense because I searched up online and we can see that most wildfires happen in July through October.
The primary question you will answer is whether daily concentrations of PM (particulate matter air pollution with aerodynamic diameter less than 2.5m) decreased in California over the 20 years spanning from 2002 to 2022. The answer to this question is that most counties have experienced decreased daily concentrations of PM2.5 (particulate matter air pollution with an aerodynamic diameter less than 2.5 µm). However, in some counties, levels are even higher, particularly where wildfires are prevalent or where industrial activity is concentrated. In addition, dry, desert-like atmospheric conditions can also contribute to elevated PM2.5 levels.